Anomaly Detection: Double-Stochastic vs. Multi-Dist

We compare the algorithms against each other first on the basis of statistical mutual agreement and consistency in the absence of true anomaly labels and subsequently on the basis of a visual interpretation. The idea is to look at the differences in detection from the two schemes across common time-frames and network entities (eg: CMTS) prioritised by anomaly intensity/score, i.e. In order of anomaly intensity from each detection scheme, which detected periods and entities are common, and conversely where are the differences.

We then plot the top-k detected time-series to have a visual representation and to discover differences between the two methods. By this process, not only would we discover strengths and weaknesses of the algorithms but we’d also anticipate the customer reaction should we move to a new algorithm.

In [1]:
import pandas as pd
import seaborn as sns
sns.set_palette("muted")
pd.set_option("display.width",150)
%pylab inline
Populating the interactive namespace from numpy and matplotlib
In [2]:
%%time
np.random.seed(5)
header = ["call_date_time","call_lob_data","call_lob_voice","call_lob_video","network_level_3_name",\
          "network_level_9_name","device_data_model","device_voice_model","device_video_model","ticket_id"]
calls = pd.read_csv("data/call_dump",sep="|",names=header,usecols=range(1,len(header)+1))
calls['call_date_time'] = calls['call_date_time'].convert_objects(convert_numeric=True).dropna().astype("int").astype("str")
for col in header:
    calls[col] = calls[col].astype("str").map(str.strip).replace({"NULL":np.nan,"nan":np.nan},regex=True)
calls = calls.dropna(subset=['call_date_time'])
calls['date_hour'] = pd.DatetimeIndex(pd.to_datetime(calls['call_date_time'].str[:10],format="%Y%m%d%H")).tz_localize("UTC").tz_convert("US/Pacific-New")
/home/anant/anaconda2/lib/python2.7/site-packages/ipykernel/__main__.py:4: FutureWarning: convert_objects is deprecated.  Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.
CPU times: user 5min 50s, sys: 6.35 s, total: 5min 56s
Wall time: 5min 57s

The comcast network follows the hierarchy with a single parent per child:

  • level_1: division
  • level_2: region
  • level_3: system
  • level_8: headend
  • level_9: cmts

We group calls by the following variables to gather hourly call-counts per level in a time-series for the entire duration:

  • Network Level 3: SYSTEM
  • Network Level 9: CMTS
  • Network Level 9 + Device Model: Voice/Video/Data

Next, we take each time-series and score it using double-stochastic detection scheme introduced earlier. We gather Multi-dist scores using a pre-scored anomaly table, where each detected instance is graded into one of three levels from the "severity" column representing 'low','medium', and 'high' intensity anomalies.

In [3]:
start,end = '2016-04-01 00:00:00','2016-07-22 00:00:00'
timeframe = pd.date_range(start=start,end=end,freq="H").tz_localize("US/Pacific-New")
period = timeframe[-344:]
In [6]:
from ipyparallel import Client
import os
import time
os.system("nohup ipcluster start -n 24 &")
time.sleep(5)
rc = Client()
print rc.ids
dview = rc[:]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]

Here is the distribution of mean number of calls (X) against number of levels of the variable (Y). The below plots show that level=SYSTEM have higher call-arrival rates on average as compared to level=CMTS. Please note that only those calls have been accounted where call_is_deflected=False (i.e. the call is not deflected) and time between consecutive calls for a subscriber > 24 hours (using chain_time_lag). We observe that such filter leads to matching event-counts from the enriched_anomaly and enriched_call tables.

In [7]:
%%time
from score import score
series = ['network_level_3_name','network_level_9_name']
fig,axes = plt.subplots(1,2,figsize=(15,5))
i=0
out = {}
for var in series:
    sup = calls.groupby(["date_hour",var])['call_date_time'].count().unstack().loc[timeframe].fillna(0)
    sup.mean().hist(bins=15,ax=axes[i])
    axes[i].set_xlabel("mean #calls"), axes[i].set_ylabel("#levels in %s"%var)
    i+=1
    out[var] = dict(zip(sup.columns,dview.map_sync(score, [sup[col] for col in sup.columns], [period]*len(sup.columns))))
CPU times: user 36.1 s, sys: 6.1 s, total: 42.2 s
Wall time: 7min 4s

Next, we extend the detection to combinations of two variables, namely device video/voice/data models and CMTS (network_level_9). Calls comprised of only the specific LOB corresponding to the device type have been considered in the counts. This gives time-series with even lower event-counts to analyse the detection sensitivity of the new scheme with data containing many zeros. Here is the distribution of mean #calls per level of the combination:

In [8]:
%%time
devices = ['voice','video','data']
levels = ['device_%s_model','network_level_9_name']
fig,axes = plt.subplots(1,3,figsize=(17,5))
i=0
for device in devices:
    sup = calls[calls['call_lob_%s'%device]=="true"].groupby(["date_hour",levels[0]%device,levels[1]])['call_date_time'].count().unstack([1,2]).loc[timeframe].fillna(0)
    sup = sup.loc[:,sup.max()>=5]
    sup.mean().hist(bins=15,ax=axes[i])
    axes[i].set_xlabel("mean #calls"), axes[i].set_ylabel("#levels in %s + network_level_9"%device)
    i+=1
    print "# of levels with no less than 5 max. calls for device type %s: %s"%(device,sup.shape[1])
    out[device+"_level_9"] = dict(zip(sup.columns,dview.map_sync(score, [sup[col] for col in sup.columns], [period]*len(sup.columns))))
# of levels with no less than 5 max. calls for device type voice: 382
# of levels with no less than 5 max. calls for device type video: 2121
# of levels with no less than 5 max. calls for device type data: 1165
CPU times: user 1min 14s, sys: 12.7 s, total: 1min 27s
Wall time: 8min 8s

Evaluation

A ranking model's purpose is to rank, i.e. produce a permutation of items in new, unseen lists in a way which is "similar" to rankings in the training data in some sense. We choose a common duration of time (14 days: 2016-07-09 00:00 -> 2016-07-21 23:00) and rank the detections from both methods (precomputed "severity" column for multi-dist and scores computed above for double-stochastic) to find out the most anomalous time-series considering all levels in SYSTEM and CMTS from each method.

The analysis is extended to include statistical measures of comparison and develop a quantitative characterization of each method's relative performance in addition to purely qualitative visual interpretations. The following measures are chosen to compare the algorithms:

Precision & Recall: are used to measure the detection performance from double-stochastic. These are computed on the double-stochastic scores assuming all detections from multi-dist (any of severity: 1,2,3) are considered as representative of true anomalies. This gives a baseline comparison criteria in a simple yes/no fashion to quantify consistency between the two methods.

Discounted cumulative gain: (DCG) is a measure of ranking quality and is often used to measure effectiveness of web search algorithms. Using a graded relevance scale, DCG measures the usefulness (or gain) of a result based on its position. The gain is accumulated from the top of the result list to the bottom with rank discounting. The discounted CG accumulated at a particular rank position p is defined as:

$${\mathrm {DCG_{{p}}}}=\sum _{{i=1}}^{{p}}{\frac {2^{{rel_{{i}}}}-1}{\log _{{2}}(i+1)}}$$

Two assumptions are made in using DCG and its related measures for evaluating anomaly detection.

* Alarms can be graded into severity levels representing true relevance (multi-dist severity).
* Severe alarms are more useful when ranked higher (from double-stochastic scores).

The cumulative gain at each position is normalized across detection levels by sorting all alarms by their relative relevance, producing the maximum possible DCG through position p, also called Ideal DCG (IDCG) through that position. The normalized discounted cumulative gain, or nDCG, is then computed as:

$${\mathrm {nDCG_{{p}}}}={\frac {DCG_{{p}}}{IDCG_{{p}}}}$$
In [9]:
header = ["timebin","event_count","severity","rating","ranking","feature1","value1","feature2","value2","call_lob_data","call_lob_video","call_lob_voice","device_video_model","device_data_model","device_voice_model"]
prod_anomaly = pd.read_csv("data/anomaly_dump",sep="|",skiprows=3,usecols=[3,6,8,9,10,14,15,16,17,18,19,20,26,30,34],names=header)
for col in header:
    prod_anomaly[col] = prod_anomaly[col].astype("str").map(str.strip).replace({"NULL":np.nan,"nan":np.nan},regex=True)
prod_anomaly = prod_anomaly.dropna(subset=['timebin'])
prod_anomaly = prod_anomaly.convert_objects(convert_numeric=True)
prod_anomaly['timebin'] = prod_anomaly['timebin'].dropna().astype("int").astype("str")
prod_anomaly['date_hour'] = pd.DatetimeIndex(pd.to_datetime(prod_anomaly['timebin'].str[:10],format="%Y%m%d%H"))
prod_anomaly['minutes'] = prod_anomaly['timebin'].str[10:12]
prod_anomaly = prod_anomaly[prod_anomaly['severity']<>"severity"]
prod_anomaly['severity'] = prod_anomaly['severity'].replace({"LOW":1,"MEDIUM":2,"HIGH":3}).astype("float64")
/home/anant/anaconda2/lib/python2.7/site-packages/ipykernel/__main__.py:6: FutureWarning: convert_objects is deprecated.  Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.
In [10]:
def dcg_score(y_true, y_score, k=10, gains="exponential"):
    order = np.argsort(y_score)[::-1]
    y_true = np.take(y_true, order[:k])
    if gains == "exponential": gains = 2 ** y_true - 1
    elif gains == "linear": gains = y_true
    else: raise ValueError("Invalid gains option.")
    discounts = np.log2(np.arange(len(y_true)) + 2)
    return np.sum(gains / discounts)

def ndcg_score(y_true, y_score, k=10, gains="exponential"):
    best = dcg_score(y_true, y_true, k, gains)
    actual = dcg_score(y_true, y_score, k, gains)
    return actual / best

def compute_baseline(timeseries):
    seg = []
    for day in range(7):
        for hour in range(24):
            sub = timeseries[timeseries.index.dayofweek==day]
            sub = sub[sub.index.hour==hour]
            seg.append(sub)
    baseline = pd.concat([pd.rolling_median(sub,5).shift(1) for sub in seg]).sort_index().fillna(0)
    return baseline
In [11]:
for key in out.keys(): 
    out[key] = pd.concat(out[key],axis=1).sort_index()
    out[key].index = out[key].index.tz_localize(None)
In [12]:
cut = 15
detection = {"device+level_9":[],"network_level_3_name":[],"network_level_9_name":[]}

start,end = '2016-07-09 00:00:00','2016-07-21 23:00:00'
time_range = pd.date_range(start=start,end=end,freq="H")
for device in devices:
    sup = calls[calls['call_lob_%s'%device]=="true"].groupby(["date_hour",levels[0]%device,levels[1]])['call_date_time'].count().unstack([1,2]).loc[timeframe].fillna(0)
    anom = []
    if sup.index.tz is not None: sup.index = sup.index.tz_convert("UTC").tz_localize(None)
    for time in time_range:
        temp = out[device+"_level_9"].xs("double_stochastic",axis=1,level=2).loc[time].dropna()
        anom.append(pd.DataFrame(temp))
    detected = pd.concat(anom).stack().groupby(level=[0,1,2]).max()
    prod_detected = prod_anomaly[prod_anomaly['minutes']=='00'][prod_anomaly['date_hour']>=time_range[0]][prod_anomaly['date_hour']<time_range[-1]][prod_anomaly['feature1']==str.upper(levels[0]%device)][prod_anomaly['feature2']==str.upper(levels[1])].groupby(['value1','value2','date_hour'])['severity'].max()
    temp = pd.concat({"double_stochastic":detected[detected>cut],"multi_dist":prod_detected},axis=1).dropna(how="all").fillna({"double_stochastic":cut,"multi_dist":0})
    temp.loc[temp['double_stochastic']>500,"double_stochastic"] = 500
    temp = pd.concat({"double_stochastic":detected[detected>cut],"multi_dist":prod_detected},axis=1).dropna(how="all").fillna({"double_stochastic":cut,"multi_dist":0})
    temp.index = pd.MultiIndex.from_tuples(zip(temp.index.get_level_values(0)+"_%s"%device,temp.index.get_level_values(1),temp.index.get_level_values(2)))
    temp['anomaly_multi_dist'] = 0
    temp.loc[temp["multi_dist"]>0,"anomaly_multi_dist"] = 1
    detection["device+level_9"].append(temp)

detection['device+level_9'] = pd.concat(detection['device+level_9'])

for var in series:
    anom = []
    sup = calls.groupby(["date_hour",var])['call_date_time'].count().unstack().loc[timeframe].fillna(0)
    if sup.index.tz is not None: sup.index = sup.index.tz_convert("UTC").tz_localize(None)
    for time in time_range:
        temp = out[var].xs("double_stochastic",axis=1,level=1).loc[time]
        anom.append(pd.DataFrame(temp))
    detected = pd.concat(anom).stack().groupby(level=[0,1]).max()
    prod_detected = prod_anomaly[prod_anomaly['minutes']=='00'][prod_anomaly['date_hour']>=time_range[0]][prod_anomaly['date_hour']<time_range[-1]][prod_anomaly['feature1']==str.upper(var)][prod_anomaly['feature2'].isnull()].groupby(['value1','date_hour'])['severity'].max()
    temp = pd.concat({"double_stochastic":detected[detected>cut],"multi_dist":prod_detected},axis=1).dropna(how="all").fillna({"double_stochastic":cut,"multi_dist":0})
    temp.loc[temp['double_stochastic']>500,"double_stochastic"] = 500
    temp['anomaly_multi_dist'] = 0
    temp.loc[temp["multi_dist"]>0,"anomaly_multi_dist"] = 1
    detection[var] = temp
/home/anant/anaconda2/lib/python2.7/site-packages/ipykernel/__main__.py:14: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
/home/anant/anaconda2/lib/python2.7/site-packages/ipykernel/__main__.py:33: UserWarning: Boolean Series key will be reindexed to match DataFrame index.

Here is a box-plot of double-stochastic scores per level of multi-dist severity, this clearly demarcates the severity levels from an agreement/consistency point of view.

In [13]:
fig,axes = plt.subplots(1,3,figsize=(15,5))
fig.text(0.5, 0.01, 'severity multi_dist', ha='center', va='center'), fig.tight_layout()
i=0
for var in detection.keys():
    sns.boxplot(x="multi_dist", y="double_stochastic", whis=[0.25,0.75],data=detection[var],ax=axes[i])
    axes[i].legend(labels=[var]),axes[i].set(yscale="log",ylim=[10,np.percentile(detection[var]['double_stochastic'],99)],xlabel='',ylabel='')
    i+=1
axes[0].set_ylabel("double_stochastic")
Out[13]:
<matplotlib.text.Text at 0x7fd6acc20810>
In [14]:
from sklearn.metrics import precision_recall_curve
ks = np.arange(1,11)*10
fig,axes = plt.subplots(1,2,figsize=(15,6))
dcg = {}
for var in detection.keys():
    pr,re,th = precision_recall_curve(detection[var]['anomaly_multi_dist'].values,detection[var]['double_stochastic'].values)
    axes[0].plot(re[pr>pr[argmax(re)]],pr[pr>pr[argmax(re)]],label=var,lw=3), axes[0].set_xlabel("Recall"), axes[0].set_ylabel("Precision")
    dcg[var] = pd.Series(index=ks,data=np.array([ndcg_score(detection[var]['multi_dist'].values,detection[var]['double_stochastic'].values,k=k) for k in ks]))
    
pd.concat(dcg,axis=1)[detection.keys()].plot(kind="bar",ax=axes[1])
axes[1].legend(loc="upper right",bbox_to_anchor=(1.0,1.2)), axes[1].set_xlabel("top k alarms"), axes[1].set_ylabel("nDCG@k")
fig.tight_layout()

Next, in order to highlight the agreement/consistency in the two detection schemes, we train a statistical model with one as a monotonic function of the other. The underlying goal is to transform the scores into a "model" and "residual" space where each instance can be quantified in terms of its "dissimilarity" in the two methods, the advantage being that more dissimiliar instances are prioritized for further analysis.

Proportional Odds Model: is a regression model for ordinal dependent variables, for example, a choice among "poor", "fair", "good", "very good", and "excellent" given a set of predictors. It can be thought of as an extension of the binary logistic regression model allowing for more than two (ordered) response categories.

We predict the ordered multi-dist severity levels using double-stochastic scores as the predictor. In this model, the log-odds are assumed to be linear with the predictors and thus all relationships are constrained to be monotonic (linear to be precise), which is a desirable attribute for our comparative purposes. The fundamental goal is to account for the variance explained between the two methods and subsequently transform the scores into a model+residual space where overall agreement is quantified using measures like Mean Absolute Error/ per ordinal level and inconsistency is measured in the residuals.

In [15]:
import readline
import rpy2
%load_ext rpy2.ipython
In [16]:
%R require("MASS")
%R require("regr0")
%R i<-1
for key in detection.keys():
    temp = detection[key][['multi_dist','double_stochastic']].copy()
    %Rpush temp
    %R new <- data.frame(double_stochastic=as.numeric(temp[,'double_stochastic']),multi_dist=as.factor(temp[,'multi_dist']))
    %R logreg <- paste("logreg_", i, sep = "")
    %R assign(logreg, polr(multi_dist~log(double_stochastic), data=new, Hess=TRUE))
    %R resid_logreg <- residuals(eval(parse(text=logreg)))[,1]
    %R prediction <- predict(eval(parse(text=logreg)))
    %Rpull resid_logreg
    %Rpull prediction
    %R i=i+1
    detection[key]['residuals'] = resid_logreg
    detection[key]['prediction'] = prediction.astype("int64")
/home/anant/anaconda2/lib/python2.7/site-packages/rpy2/rinterface/__init__.py:185: RRuntimeWarning: Loading required package: MASS

  warnings.warn(x, RRuntimeWarning)
/home/anant/anaconda2/lib/python2.7/site-packages/rpy2/rinterface/__init__.py:185: RRuntimeWarning: Loading required package: regr0

  warnings.warn(x, RRuntimeWarning)
/home/anant/anaconda2/lib/python2.7/site-packages/rpy2/rinterface/__init__.py:185: RRuntimeWarning: 
Attaching package: ‘regr0’


  warnings.warn(x, RRuntimeWarning)
/home/anant/anaconda2/lib/python2.7/site-packages/rpy2/rinterface/__init__.py:185: RRuntimeWarning: The following object is masked from ‘package:stats’:

    step


  warnings.warn(x, RRuntimeWarning)
/home/anant/anaconda2/lib/python2.7/site-packages/rpy2/rinterface/__init__.py:185: RRuntimeWarning: The following object is masked from ‘package:base’:

    subset


  warnings.warn(x, RRuntimeWarning)
In [17]:
%%R -h 350 -w 1000
par(mfrow=c(1,3))
for(i in 1:3){
    logreg <- paste("logreg_", i, sep = "")
    plTA.polr(eval(parse(text=logreg)))
}
In [18]:
fig,axes = plt.subplots(1,3,figsize=(19,5))
i=0
for col in detection.keys():
    groups = detection[col].groupby("multi_dist").groups
    mae,sev = [],[]
    for severity,index in groups.iteritems():
        mae.append(np.abs(detection[col].loc[index]['prediction'] - severity).mean()), sev.append(severity)
    axes[i].bar(sev,mae), axes[i].set(ylabel="Mean Absolute Error",xlabel="severity_multi_dist",title=col,xticks=[0,1,2,3],xticklabels=['none','low','med','high'])
    i+=1

Finally, we plot the top ranked detections on their inconsistency obtained from the residual space as a result of fitting the proportional odds model. This is done for all network variables (SYSTEM, CMTS, CMTS+Device) taken individually for the chosen duration along with the baseline plot to aid visual representation and qualitative analysis. The plots are ranked based on the degree of inconsistency and can be divided into two categories, one where double-stochastic scores are relatively lower than multi-dist severity and vice-versa.

In [19]:
cut = 25
from IPython.display import display
cols = ['multi_dist','double_stochastic','residuals']
for col in detection.keys():
    if col <> "device+level_9":
        levels = detection[col].sort("residuals")[-10:].index.get_level_values(0)
        for level in set(levels):
            sup = calls[calls[col]==level].groupby('date_hour')['call_date_time'].count().loc[timeframe].fillna(0)
            if sup.index.tz is not None: sup.index = sup.index.tz_convert("UTC").tz_localize(None)
            fig = plt.figure(figsize=(15,4))
            detected = detection[col].xs(level,level=0)
            times_DS,intensity_DS = detected[detected['double_stochastic']>cut].sort("double_stochastic")[-5:].index, detected[detected['double_stochastic']>cut]['double_stochastic'].order()[-5:].values.tolist()
            times_MD,intensity_MD = detected[detected['multi_dist']>0].index, detected[detected['multi_dist']>0]['multi_dist'].values.tolist()
            sup.loc[period.tz_convert("UTC").tz_localize(None)].plot(label="observed")
            compute_baseline(sup).loc[period.tz_convert("UTC").tz_localize(None)].plot(label="baseline")
            plt.scatter(list(times_DS),sup.ix[times_DS].values,c='r',marker="*",s=70,alpha=0.6, label="Double-Stochastic: %s"%np.round(intensity_DS))
            plt.scatter(list(times_MD),sup.ix[times_MD].values,c='g',marker="^",s=70,alpha=0.6, label="Multi-Dist: %s"%intensity_MD)
            plt.suptitle(col+": "+level), plt.legend()
    else:
        temp = detection[col].sort("residuals")[-10:].index
        levels = [tuple(mod[0].split("_")+[mod[1]]) for mod in temp]    
        for level in set(levels):
            sup = calls[calls["device_%s_model"%level[1]]==level[0]][calls['call_lob_%s'%level[1]]=="true"][calls['network_level_9_name']==level[2]].groupby('date_hour')['call_date_time'].count().loc[timeframe].fillna(0)
            if sup.index.tz is not None: sup.index = sup.index.tz_convert("UTC").tz_localize(None)
            fig = plt.figure(figsize=(15,4))
            detected = detection[col].xs("%s_%s"%(level[0],level[1]),level=0).xs(level[2],level=0)
            times_DS,intensity_DS = detected[detected['double_stochastic']>cut].sort("double_stochastic")[-5:].index, detected[detected['double_stochastic']>cut]['double_stochastic'].order()[-5:].values.tolist()
            times_MD,intensity_MD = detected[detected['multi_dist']>0].index, detected[detected['multi_dist']>0]['multi_dist'].values.tolist()
            sup.loc[period.tz_convert("UTC").tz_localize(None)].plot(label="observed")
            compute_baseline(sup).loc[period.tz_convert("UTC").tz_localize(None)].plot(label="baseline")
            plt.scatter(list(times_DS),sup.ix[times_DS].values,c='r',marker="*",s=70,alpha=0.6, label="Double-Stochastic: %s"%np.round(intensity_DS))
            plt.scatter(list(times_MD),sup.ix[times_MD].values,c='g',marker="^",s=70,alpha=0.6, label="Multi-Dist: %s"%intensity_MD)
            plt.suptitle(col+": %s, %s, %s"%(level[1],level[0],level[2])), plt.legend()
    print "============================ %s =========================== \n"%col
    display(detection[col][cols].sort("residuals")[-10:])
/home/anant/anaconda2/lib/python2.7/site-packages/ipykernel/__main__.py:6: FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)
/home/anant/anaconda2/lib/python2.7/site-packages/ipykernel/__main__.py:12: FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)
/home/anant/anaconda2/lib/python2.7/site-packages/ipykernel/__main__.py:12: FutureWarning: order is deprecated, use sort_values(...)
/home/anant/anaconda2/lib/python2.7/site-packages/ipykernel/__main__.py:22: FutureWarning: pd.rolling_median is deprecated for Series and will be removed in a future version, replace with 
	Series.rolling(window=5,center=False).median()
============================ network_level_3_name =========================== 

/home/anant/anaconda2/lib/python2.7/site-packages/ipykernel/__main__.py:35: FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)
multi_dist double_stochastic residuals
beltway central 2016-07-16 06:00:00 3.0 101.304556 2.345301
jacksonville, fl 2016-07-16 00:00:00 3.0 58.892146 2.854919
beltway central 2016-07-16 11:00:00 3.0 56.475580 2.895084
jacksonville, fl 2016-07-15 23:00:00 3.0 52.891236 2.958129
wnes 2016-07-16 11:00:00 3.0 36.099140 3.329340
fl panhandle 2016-07-10 22:00:00 3.0 29.422551 3.530361
wnes 2016-07-16 19:00:00 3.0 18.094211 4.012787
west michigan 2016-07-10 12:00:00 3.0 17.343168 4.055093
wnes 2016-07-16 18:00:00 3.0 15.000000 4.200173
fl panhandle 2016-07-11 12:00:00 3.0 15.000000 4.200173
/home/anant/anaconda2/lib/python2.7/site-packages/ipykernel/__main__.py:20: FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)
/home/anant/anaconda2/lib/python2.7/site-packages/ipykernel/__main__.py:23: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
/home/anant/anaconda2/lib/python2.7/site-packages/ipykernel/__main__.py:27: FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)
/home/anant/anaconda2/lib/python2.7/site-packages/ipykernel/__main__.py:27: FutureWarning: order is deprecated, use sort_values(...)
============================ device+level_9 =========================== 

multi_dist double_stochastic residuals
tg862g (arris)_voice ten03.everett.wa.seattle.comcast.net 2016-07-21 21:00:00 2.0 15.000000 3.369125
ten01.wolcott.ct.hartford.comcast.net 2016-07-18 14:00:00 2.0 15.000000 3.369125
tm608g (arris)_voice cbr01.hamden.ct.hartford.comcast.net 2016-07-18 13:00:00 2.0 15.000000 3.369125
tg1682g (arris)_voice ten01.wolcott.ct.hartford.comcast.net 2016-07-18 14:00:00 2.0 15.000000 3.369125
ten01.shelton.ct.hartford.comcast.net 2016-07-18 14:00:00 2.0 15.000000 3.369125
tm608g (arris)_voice ten01.wolcott.ct.hartford.comcast.net 2016-07-18 14:00:00 2.0 15.000000 3.369125
tc8305c (technicolor)_voice ten01.shelton.ct.hartford.comcast.net 2016-07-18 14:00:00 2.0 15.000000 3.369125
tg862g (arris)_voice ten04.everett.wa.seattle.comcast.net 2016-07-21 21:00:00 2.0 15.000000 3.369125
dpc3941t (cisco)_voice ten01.wolcott.ct.hartford.comcast.net 2016-07-18 14:00:00 3.0 16.920831 4.685305
dpc3939 (cisco)_voice cbr01.naugatuck.ct.hartford.comcast.net 2016-07-18 14:00:00 3.0 16.815285 4.703607
/home/anant/anaconda2/lib/python2.7/site-packages/matplotlib/pyplot.py:524: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
  max_open_warning, RuntimeWarning)
============================ network_level_9_name =========================== 

multi_dist double_stochastic residuals
acr02.danville.va.richmond.comcast.net 2016-07-14 12:00:00 3.0 25.887488 2.932370
ten02.eastshrvprt.la.malt.comcast.net 2016-07-10 22:00:00 3.0 24.840203 3.051249
ten04.sandia.nm.albuq.comcast.net 2016-07-20 02:00:00 3.0 24.568615 3.083019
ten04.bellevue.wa.seattle.comcast.net 2016-07-11 18:00:00 3.0 24.075507 3.141736
ten01.leaguecity.tx.houston.comcast.net 2016-07-12 21:00:00 3.0 21.756304 3.437255
ten04.westfield.ma.boston.comcast.net 2016-07-20 11:00:00 3.0 20.181309 3.658449
ten01.leaguecity.tx.houston.comcast.net 2016-07-12 23:00:00 3.0 15.000000 4.542342
ten03.sharpsridge.tn.knox.comcast.net 2016-07-15 02:00:00 3.0 15.000000 4.542342
ten01.nashua.nh.boston.comcast.net 2016-07-10 13:00:00 3.0 15.000000 4.542342
acr03.union.nj.panjde.comcast.net 2016-07-13 02:00:00 3.0 15.000000 4.542342

The above plots depict instances where multi-dist assigns high severity (2,3) whereas double-stochastic assigns lower ratings. Per visual inspection, we observe that there exist notable differences in the relative detection performance among the most dissimilar instances. Most of these observations, although detected by both methods, have vastly different degrees of severity assigned. For instance, in multiple cases where the observed #calls ~ 10 over a baseline #calls ~ 0, multi-dist assigns a severity=3 signifying extreme outliers, while double-stochastic is conservative in flagging such observations.

In [20]:
for col in detection.keys():
    if col <> "device+level_9":
        levels = detection[col].sort("residuals")[:10].index.get_level_values(0)
        for level in set(levels):
            sup = calls[calls[col]==level].groupby('date_hour')['call_date_time'].count().loc[timeframe].fillna(0)
            if sup.index.tz is not None: sup.index = sup.index.tz_convert("UTC").tz_localize(None)
            fig = plt.figure(figsize=(15,4))
            detected = detection[col].xs(level,level=0)
            times_DS,intensity_DS = detected[detected['double_stochastic']>cut].index, detected[detected['double_stochastic']>cut]['double_stochastic'].values.tolist()
            times_MD,intensity_MD = detected[detected['multi_dist']>0].index, detected[detected['multi_dist']>0]['multi_dist'].values.tolist()
            sup.loc[period.tz_convert("UTC").tz_localize(None)].plot(label="observed")
            compute_baseline(sup).loc[period.tz_convert("UTC").tz_localize(None)].plot(label="baseline")
            plt.scatter(list(times_DS),sup.ix[times_DS].values,c='r',marker="*",s=70,alpha=0.6, label="Double-Stochastic: %s"%np.round(intensity_DS))
            plt.scatter(list(times_MD),sup.ix[times_MD].values,c='g',marker="^",s=70,alpha=0.6, label="Multi-Dist: %s"%intensity_MD)
            plt.suptitle(col+": "+level), plt.legend()
    else:
        temp = detection[col].sort("residuals")[:10].index
        levels = [tuple(mod[0].split("_")+[mod[1]]) for mod in temp]    
        for level in set(levels):
            sup = calls[calls["device_%s_model"%level[1]]==level[0]][calls['call_lob_%s'%level[1]]=="true"][calls['network_level_9_name']==level[2]].groupby('date_hour')['call_date_time'].count().loc[timeframe].fillna(0)
            if sup.index.tz is not None: sup.index = sup.index.tz_convert("UTC").tz_localize(None)
            fig = plt.figure(figsize=(15,4))
            detected = detection[col].xs("%s_%s"%(level[0],level[1]),level=0).xs(level[2],level=0)
            times_DS,intensity_DS = detected[detected['double_stochastic']>cut].index, detected[detected['double_stochastic']>cut]['double_stochastic'].values.tolist()
            times_MD,intensity_MD = detected[detected['multi_dist']>0].index, detected[detected['multi_dist']>0]['multi_dist'].values.tolist()
            sup.loc[period.tz_convert("UTC").tz_localize(None)].plot(label="observed")
            compute_baseline(sup).loc[period.tz_convert("UTC").tz_localize(None)].plot(label="baseline")
            plt.scatter(list(times_DS),sup.ix[times_DS].values,c='r',marker="*",s=70,alpha=0.6, label="Double-Stochastic: %s"%np.round(intensity_DS))
            plt.scatter(list(times_MD),sup.ix[times_MD].values,c='g',marker="^",s=70,alpha=0.6, label="Multi-Dist: %s"%intensity_MD)
            plt.suptitle(col+": %s, %s, %s"%(level[1],level[0],level[2])), plt.legend()
    print "============================ %s =========================== \n"%col
    display(detection[col][cols].sort("residuals")[:10])
/home/anant/anaconda2/lib/python2.7/site-packages/ipykernel/__main__.py:3: FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)
  app.launch_new_instance()
/home/anant/anaconda2/lib/python2.7/site-packages/ipykernel/__main__.py:22: FutureWarning: pd.rolling_median is deprecated for Series and will be removed in a future version, replace with 
	Series.rolling(window=5,center=False).median()
============================ network_level_3_name =========================== 

/home/anant/anaconda2/lib/python2.7/site-packages/ipykernel/__main__.py:32: FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)
multi_dist double_stochastic residuals
northern new jersey 2016-07-12 23:00:00 0.0 240.190217 -2.317442
wnen 2016-07-16 07:00:00 0.0 193.611601 -2.122270
gbrn 2016-07-16 07:00:00 0.0 149.953103 -1.897553
noco 2016-07-20 00:00:00 0.0 118.520103 -1.698247
knoxville 2016-07-14 20:00:00 0.0 101.823170 -1.574052
wnen 2016-07-20 08:00:00 0.0 88.469514 -1.462549
nashville 2016-07-12 17:00:00 0.0 88.380196 -1.461761
gbrs 2016-07-14 06:00:00 0.0 84.829049 -1.429912
wnen 2016-07-20 09:00:00 0.0 77.570106 -1.361546
denver metro 2016-07-14 07:00:00 0.0 76.288665 -1.348987
/home/anant/anaconda2/lib/python2.7/site-packages/ipykernel/__main__.py:17: FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)
/home/anant/anaconda2/lib/python2.7/site-packages/ipykernel/__main__.py:20: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
============================ device+level_9 =========================== 

multi_dist double_stochastic residuals
dci105com1 (thomson)_video cbr01.mainshrvprt.la.malt.comcast.net 2016-07-10 20:00:00 0.0 60.833617 -2.672095
pcrng110 (pace)_video acr03.union.nj.panjde.comcast.net 2016-07-12 23:00:00 0.0 46.120313 -1.938543
dci105com1 (thomson)_video cbr01.mainshrvprt.la.malt.comcast.net 2016-07-16 01:00:00 0.0 43.302733 -1.781475
tg862g (arris)_voice acr03.union.nj.panjde.comcast.net 2016-07-12 23:00:00 0.0 40.382463 -1.613195
pcrng150bnmd (pace)_video acr03.union.nj.panjde.comcast.net 2016-07-12 23:00:00 0.0 35.963490 -1.349243
tc8305c (technicolor)_data ten03.everett.wa.seattle.comcast.net 2016-07-21 21:00:00 0.0 35.725674 -1.334777
dpc3941t (cisco)_data ten04.everett.wa.seattle.comcast.net 2016-07-21 21:00:00 0.0 35.453967 -1.318222
sarng100 (cisco)_video cbr01.mainshrvprt.la.malt.comcast.net 2016-07-10 20:00:00 0.0 34.737509 -1.274447
dc50xum (pace)_video ten01.westshrvprt.la.malt.comcast.net 2016-07-16 01:00:00 0.0 34.124301 -1.236853
dpc3941t (cisco)_data ten03.everett.wa.seattle.comcast.net 2016-07-21 21:00:00 0.0 32.736264 -1.151433
============================ network_level_9_name =========================== 

multi_dist double_stochastic residuals
cbr01.mainshrvprt.la.malt.comcast.net 2016-07-10 20:00:00 0.0 179.025311 -6.793453
ten01.westshrvprt.la.malt.comcast.net 2016-07-10 20:00:00 0.0 88.120413 -4.653510
cbr01.mainshrvprt.la.malt.comcast.net 2016-07-10 21:00:00 0.0 79.785254 -4.355683
acr02.aspen.co.denver.comcast.net 2016-07-20 00:00:00 0.0 73.199616 -4.098395
ten01.northshrvprt.la.malt.comcast.net 2016-07-16 01:00:00 0.0 70.519506 -3.987323
acr01.frenchcamp.ca.ccal.comcast.net 2016-07-14 10:00:00 0.0 68.991350 -3.922194
ten01.northshrvprt.la.malt.comcast.net 2016-07-10 20:00:00 0.0 68.194771 -3.887704
acr01.summit.co.denver.comcast.net 2016-07-09 02:00:00 2.0 149.499231 -3.727877
ten02.beverly.ma.boston.comcast.net 2016-07-12 18:00:00 0.0 61.325129 -3.573634
cbr02.peabody.ma.boston.comcast.net 2016-07-12 18:00:00 0.0 61.072276 -3.561465

These above plots depict instances where double-stochastic assigns high severity whereas multi-dist deems these as not so severe. It is now observed that double-stochastic is detecting multiple instances where multi-dist fails to detect anything, for example, baseline #calls ~ 5 and observed #calls ~ 30-40 are only detected by double-stochastic.

One distinctive characteristic of double-stochastic is that it assigns incrementally lower scores for the same time-series (eg. network_level_9_name = "ten04.everett.wa.seattle.comcast.net") in consecutive hours since it computes the conditional probability of extreme values given what has already been observed. This is desirable from a marginal relevance perspective.

Relevance and Marginal Relevance: whether an alarm still has distinctive usefulness after the user has looked at certain other alarms (Carbonell and Goldstein, 1998). Even if an alarm is highly relevant, its information can be completely redundant with other alarms which have already been examined. The most extreme case of this is alarms that are consecutive - a phenomenon that is actually very common. In such circumstances, marginal relevance is clearly a better measure of utility to the user. Maximizing marginal relevance requires returning alarms that exhibit diversity and novelty. Whether this is desirable from a network operator's point of view is still unanswered and more depth on this would prove helpful.

In [21]:
from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')
Out[21]: